1. Preparation

1.1 Grab all .xml, .sh, .sql and .hql files from TrainingOnHDP/DataManagementOnFalcon/churnanalysis at github into /root/TrainingOnHDP/DataManagementOnFalcon/churnanalysis at your HDP sandbox

1.2 Run the following command via SSH on your sandbox:
	
	ex -bsc '%!awk "{sub(/\r/,\"\")}1"' -cx /root/TrainingOnHDP/DataManagementOnFalcon/churnanalysis/customerIngest.sh
	
	su - hdfs
	
	hadoop fs -chmod -R 777 /user/root
	hadoop fs -mkdir /user/root/falcon
	hadoop fs -rm -skipTrash /user/root/falcon/customerIngest.sh
	hadoop fs -copyFromLocal /root/TrainingOnHDP/DataManagementOnFalcon/churnanalysis/customerIngest.sh /user/root/falcon
	hadoop fs -copyFromLocal /root/TrainingOnHDP/DataManagementOnFalcon/churnanalysis/workflow.xml /user/root/falcon
	hadoop fs -rm -skipTrash /user/root/falcon/DataManagementOnFalcon-1.0-SNAPSHOT.jar
	hadoop fs -copyFromLocal /root/TrainingOnHDP/DataManagementOnFalcon/target/DataManagementOnFalcon-1.0-SNAPSHOT.jar /user/root/falcon
	
	exit
	
	su - falcon
	
	hadoop fs -mkdir /apps/falcon/SourceCluster
	hadoop fs -mkdir /apps/falcon/SourceCluster/staging
	hadoop fs -mkdir /apps/falcon/SourceCluster/working
	hadoop fs -chmod 777 /apps/falcon/SourceCluster/staging
	hadoop fs -chmod 755 /apps/falcon/SourceCluster/working

	hadoop fs -mkdir /apps/falcon/TransformationCluster
	hadoop fs -mkdir /apps/falcon/TransformationCluster/staging
	hadoop fs -mkdir /apps/falcon/TransformationCluster/working
	hadoop fs -chmod 777 /apps/falcon/TransformationCluster/staging
	hadoop fs -chmod 755 /apps/falcon/TransformationCluster/working

	exit
	
1.3 Fix falcon issue

	Login Ambari console to change Falcon Port from 15000 to 15500
	
	wget http://repo.boundlessgeo.com/main/com/sleepycat/je/5.0.84/je-5.0.84.jar 
	
	cp je-5.0.84.jar /usr/hdp/2.6.3.0-235/falcon/webapp/falcon/WEB-INF/lib

	chown falcon:hadoop /usr/hdp/2.6.3.0-235/falcon/webapp/falcon/WEB-INF/lib/je-5.0.84.jar 	
	
	Restart Falcon Server
	
1.4 Make Oozie to support Spark 2

	su - oozie
	hdfs dfs -mkdir /user/oozie/share/lib/lib_20171110144231/spark2
	hdfs dfs -put /usr/hdp/2.6.3.0-235/spark2/jars/* /user/oozie/share/lib/lib_20171110144231/spark2/	
	hdfs dfs -cp /user/oozie/share/lib/lib_20171110144231/spark/oozie-sharelib-spark-4.2.0.2.6.3.0-235.jar /user/oozie/share/lib/lib_20171110144231/spark2/
	hdfs dfs -cp /user/oozie/share/lib/lib_20171110144231/spark/hive-site.xml /user/oozie/share/lib/lib_20171110144231/spark2/
	hdfs dfs -put /usr/hdp/2.6.3.0-235/spark2/python/lib/py* /user/oozie/share/lib/lib_20171110144231/spark2/
	oozie admin -sharelibupdate
	oozie admin -shareliblist spark2
	exit
	
2. Define the customerFeed ENTITY using wizard 
	
	Log in ambari as root
	
	Start Falcon Service
	
2.1 Create Cluster

	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as root, click CREATE button and choose cluster to create cluster, make sure oozie service is up before you save, Use the information 
	below to create the cluster:
		
		Cluster Name: SourceCluster
		Colo Name: SourceDC
		Tags: SourceTag		
			  DataIngestion
		
		Replace all <hostname> as sandbox-hdp.hortonworks.com
		
		Yarn Resource Manager Address: sandbox-hdp.hortonworks.com:8032
		
		Spark: Local
		
	Click SAVE Button
	
2.2 Create Feed  

	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as root, click CREATE button and choose feed to create feed, Use the information below to create the entity:

		Feed Name: customerFeed
		Description: customer data feed
		Tag Key: customerFeedTag
		Tag Value: customerFeed
		Feed Groups: churnAnalysisDataPipeline
		Feed Type: HDFS
		Sources:
			Cluster:SourceCluster
			Data Path: /user/root/falcon/customer/input/${YEAR}-${MONTH}-${DAY}-${HOUR}
			Stat Path: /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
		Repeat Every 60 minutes

	Click SAVE Button	
		
3. Define the customerIngestProcess ENTITY using wizard

3.1 Create Process

	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as root, click CREATE button and choose process to create process, Use the information below to create the entity:
	
		Process Name: customerIngestProcess
		Tag Key: customerIngestionTag
		Tag Value: customerIngestion
		Workflow Name: customerIngestWorkflow
		Engine: Oozie
		Workflow Path: /user/root/falcon
		Repeat Every 60 minutes
		Cluster: SourceCluster
		
	Advanced Option: Properties	
		
		jobTracker:	sandbox-hdp.hortonworks.com:8032
		nameNode:	hdfs://sandbox-hdp.hortonworks.com:8020
		queueName: default
		
	Use the information below to add OUTPUTS (+):
	
		Name: output
		Feed: customerFeed
		Instance: now(0,0)
		
	Click SAVE Button
	
	
4. Define the parsedCustomerFeed ENTITY using wizard

4.1 Create Cluster

	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as ambari-qa, click CREATE button and choose cluster to create cluster, make sure oozie service is up before you save, Use the information 
	below to create the cluster:
		
		Cluster Name: TransformationCluster
		Colo Name: TransformationDC
		Tags: TransformationTag		
			  DataTransformation
		Spark: Local
		Replace all <hostname> as sandbox-hdp.hortonworks.com
		
		Yarn Resource Manager Address: sandbox-hdp.hortonworks.com:8032
	
		
	Click SAVE Button
	
4.2 Create Feed  

	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as root, click CREATE button and choose feed to create feed, Use the information below to create the entity:

		Feed Name: parsedCustomerFeed
		Description: parsed customer emails
		Tag Key: parsedCustomerTag
		Tag Value: parsedCustomer
		Feed Groups: churnAnalysisDataPipeline
		Feed Type: HDFS
		
		Enable Replication
		
		Sources:
			Cluster: SourceCluster
			Data Path: /user/root/falcon/customer/output/${YEAR}-${MONTH}-${DAY}-${HOUR}
			Stat Path: /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}

		Targets:
			Cluster: TransformationCluster
			Data Path: /user/root/falcon/customer/output_backup/${YEAR}-${MONTH}-${DAY}-${HOUR}
			Stat Path: /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
			
		Repeat Every 60 minutes

	Click SAVE Button	
 		
		
5. Define the customerTransformationProcess ENTITY using wizard
 	
	Type in the floowing URL in your browser: http://sandbox-hdp.hortonworks.com:15500/index.html to get into Falcon
	then login in as root, click CREATE button and choose process to create process, Use the information below to create the entity:
	
		Process Name: customerTransformationProcess
		Tag Key: customerTransformationTag
		Tag Value: customerTransformation
		
		Worflow Name: customerTransformationWorkflow
		Engine: Spark
		Workflow Path: /user/root/falcon
		Name: CustomerTransformationApplication
		Application: /user/root/falcon/DataManagementOnFalcon-1.0-SNAPSHOT.jar
		Main Class: ca.training.bigdata.falcon.churn.CustomerEmailTransformer
		Runs on: Local
		Spark Options:--driver-memory 1G --executor-memory 1G
		
		Repeat Every 60 minutes
		Cluster: SourceCluster
		
	Use the information below to add INPUTS (+):
	
		Name: input
		Feed: customerFeed
		Start: now(0,0)
		End: now(0,0)

	Use the information below to add OUTPUTS (+):
	
		Name: output
		Feed: parsedCustomerFeed
		Instance: now(0,0)
		
	Click SAVE Button	

6. Run the feeds

	From the Falcon Web UI home page search for the Feeds we created, Select the customerFeed by clicking on the checkbox, 
	Then click on the Schedule button on the top of the search results, Next run the parsedCustomerFeed in the same way
	
7. Run the processes
	
	From the Falcon Web UI home page search for the Process we created, Select the customerTransformationProcess by clicking on the checkbox,
	Then click on the Schedule button on the top of the search results, Next run the customerIngestProcess in the same way


8. Check the status	

	If you visit the Oozie process page, you can seen the processes running
	
9. Input and output of the pipeline

	Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS. 
	
10. Use Hive CLI to Create Hive Table

	USE oozie;
	CREATE EXTERNAL TABLE IF NOT EXISTS customer_churning (
		message_id string,
		edate string,
		efrom string,
		eto string,
		subject string,
		cc string,
		mime_type string,
		content_type string,
		content_transfer_encoding string,
		bcc string,
		x_from string,
		x_to string,
		x_cc string,
		x_bcc string,
		x_folder string,
		x_origin string,
		x_filenale string)
	STORED AS PARQUET
	LOCATION '/user/root/falcon/customer/output/2017-12-29-11';
	
	

	
	